Goto

Collaborating Authors

 cancer subtype discovery


Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data

Neural Information Processing Systems

Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without ``negative transfer'' effects often seen in existing multi-task learning and transfer learning methods.


Reviews: Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data

Neural Information Processing Systems

I am not familiar enough with the domain to really assess the novelty of the contribution. From my reading this paper is a "model - prior - Gibbs sampler" paper which seems to improve the classification scores but does not provide breakthrough to the learning community. The novelty essentially seems to come from the choice of the prior which allows but does not require factors to be shared across domains. Moreover, the authors state that details on their Gibbs sampler are provided in the supplementary materials but I can only trust them as there seems to have been some mistake on uploading the supplementary materials (the actual manuscript was submitted instead). I would have liked to have some intuition on how to chose the parameter K, and how does its value affect the results, both in terms of subtyping and in terms of complexity. In general, how does the approach scale with competitors in terms of run-time? In the case study section, how many subtypes of lung cancer were considered? Have the authors tried their approach with more than two domains?


Parea: multi-view ensemble clustering for cancer subtype discovery

arXiv.org Artificial Intelligence

Multi-view clustering methods are essential for the stratification of patients into sub-groups of similar molecular characteristics. In recent years, a wide range of methods has been developed for this purpose. However, due to the high diversity of cancer-related data, a single method may not perform sufficiently well in all cases. We present Parea, a multi-view hierarchical ensemble clustering approach for disease subtype discovery. We demonstrate its performance on several machine learning benchmark datasets. We apply and validate our methodology on real-world multi-view cancer patient data. Parea outperforms the current state-of-the-art on six out of seven analysed cancer types. We have integrated the Parea method into our developed Python package Pyrea (https://github.com/mdbloice/Pyrea), which enables the effortless and flexible design of ensemble workflows while incorporating a wide range of fusion and clustering algorithms.


Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data

Neural Information Processing Systems

Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data.